Skip to content

feat: Adding support for shared-requestor #104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

heyvister1
Copy link
Contributor

@heyvister1 heyvister1 commented Jun 13, 2025

The motivation of the following code change is to utilize same node-maintenance object(s), to efficiently synchronize required node operations, by multiple requestors. It will help to reduce consecutive node operation cycles to a single operation cycle.

First requestor which creates node-maintenance obj is its owner, and will be in-charge of creating/deleting node-maintenance obj.
Second requestor will check for existing node-maintenance obj, if so, it will add itself under spec.additionalRequestors list.

Copy link

copy-pr-bot bot commented Jun 13, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@heyvister1 heyvister1 force-pushed the joint-requestor branch 4 times, most recently from 4b4c42c to e6c5941 Compare June 15, 2025 13:01
@@ -108,6 +108,21 @@ controller manager watchers:
builder.WithPredicates(requestor.NewConditionChangedPredicate(setupLog,
requestorOpts.MaintenanceOPRequestorID))).
```

The requestor mode supports a `joint-requestor` flow where multiple operators can coordinate node maintenance operations:
Copy link
Collaborator

@adrianchiris adrianchiris Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: any chance to modify the terminology to: shared-requestor or shared-node-maintenenace

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider putting this in a sub-section of requestor

Assumptions:
1. Cluster admin, which requires requestor joint mode, needs to make sure that all operators, utilizing maintenance OP, use same upgrade policy specs (same drainSpec).
2. To be able to accommodate both GPU/Network drivers upgrade, `DrainSpec.PodSelector` should be set accordingly (hard-coded).
* podSelector: `nvidia.com/ofed-driver-upgrade-drain.skip!=true,nvidia.com/gpu-driver-upgrade-drain.skip!=true`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just making sure, drain pkg will treat this as an OR right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be treated as AND condition.
Check the following example (I have tested it)

1. pod "foo" - nvidia.com/ofed-driver-upgrade-drain.skip=true label
2. pod "bar" - nvidia.com/gpu-driver-upgrade-drain.skip=true label
3. pod "baz" - no label

Pod "foo" (has nvidia.com/ofed-driver-upgrade-drain.skip=true):
nvidia.com/ofed-driver-upgrade-drain.skip!=true → FALSE (because it equals true)
nvidia.com/gpu-driver-upgrade-drain.skip!=true → TRUE (because this label doesn't exist, so it's not equal to true)
Result: FALSE AND TRUE = FALSE

Pod "bar" (has nvidia.com/gpu-driver-upgrade-drain.skip=true):
nvidia.com/ofed-driver-upgrade-drain.skip!=true → TRUE (because this label doesn't exist, so it's not equal to true)
nvidia.com/gpu-driver-upgrade-drain.skip!=true → FALSE (because it equals true)
Result: TRUE AND FALSE = FALSE

Pod "baz" (no labels):
nvidia.com/ofed-driver-upgrade-drain.skip!=true → TRUE (because this label doesn't exist, so it's not equal to true)
nvidia.com/gpu-driver-upgrade-drain.skip!=true → TRUE (because this label doesn't exist, so it's not equal to true)
Result: TRUE AND TRUE = TRUE

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, thx for the clarification so its equivalent to:

nvidia.com/ofed-driver-upgrade-drain.skip=true OR nvidia.com/gpu-driver-upgrade-drain.skip=true

if _, exists := nm.Labels[GetUpgradeRequestorLabelKey(m.opts.MaintenanceOPRequestorID)]; !exists {
nm.Labels[GetUpgradeRequestorLabelKey(m.opts.MaintenanceOPRequestorID)] = trueString
}
err := m.K8sClient.Patch(ctx, nm, client.MergeFrom(originalNm))
Copy link
Collaborator

@adrianchiris adrianchiris Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you say in comments that optimistic lock should be used (rightfully so), but i dont see you using:

client.MergeFromWithOptions(originalNm, client.MergeFromWithOptimisticLock{})

in patch calls here an below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was missing in my latest commit

@@ -306,6 +324,114 @@ func (m *RequestorNodeStateManagerImpl) ProcessUpgradeRequiredNodes(

return nil
}
func (m *RequestorNodeStateManagerImpl) createOrUpdateNodeMaintenance(ctx context.Context,
nodeState *NodeUpgradeState) error {
// 1. check if nodeMaintenance obj exists.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these comments relevant in createOrUpdateNodeMaintenance ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe need to move some of it to docs ?

@@ -108,6 +108,21 @@ controller manager watchers:
builder.WithPredicates(requestor.NewConditionChangedPredicate(setupLog,
requestorOpts.MaintenanceOPRequestorID))).
```

The requestor mode supports a `joint-requestor` flow where multiple operators can coordinate node maintenance operations:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider putting this in a sub-section of requestor

@@ -438,12 +557,12 @@ func GetRequestorOptsFromEnvs() RequestorOptions {
if os.Getenv("MAINTENANCE_OPERATOR_REQUESTOR_ID") != "" {
opts.MaintenanceOPRequestorID = os.Getenv("MAINTENANCE_OPERATOR_REQUESTOR_ID")
} else {
opts.MaintenanceOPRequestorID = "nvidia.operator.com"
opts.MaintenanceOPRequestorID = DefaultNodeMaintenanceRequestorID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its not really a great ux IMO to enable this feature when the two env vars are unset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be checking if DefaultNodeMaintenanceRequestorID is used before using share-requestor flow

@heyvister1 heyvister1 requested a review from adrianchiris June 19, 2025 05:59
@heyvister1 heyvister1 changed the title feat: Adding support for joint requestors feat: Adding support for shared-requestor Jun 19, 2025
@heyvister1
Copy link
Contributor Author

@cdesiniotis, @tariq1890 please help to review this PR

@heyvister1 heyvister1 force-pushed the joint-requestor branch 4 times, most recently from b062823 to 3eaf949 Compare June 19, 2025 14:22
1. Each operator adds its dedicated operator label to the nodeMaintenance object
2. When a nodeMaintenance object exists, additional operators append their requestorID to the spec.AdditionalRequestors list
3. During `uncordon-required` completion:
- Non-owning operators remove themselves from spec.AdditionalRequestors list using optimistic locking
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who is the "owning operator" ?

it is the operator that managed to create the NodeMaintenance object.
which means for a given NodeMaintenance obj (name of obj is the same in the shared-requestor mode for all cooperating operators) its the operator whose RequestorID is set in spec.requestorID

who is a "non owning" operator ?

its the operator that did not create the NodeMaintenances object, which means for a given NodeMaintenance obj its the operator whose RequestorID is present in spec.AdditionalRequestors

i would define these two roles and how its determined according to the content of NodeMaintenance obj.

@@ -130,6 +130,10 @@ func (m *InplaceNodeStateManagerImpl) ProcessUncordonRequiredNodes(
if nodeState.NodeMaintenance != nil {
continue
}
// check if if node upgrade is handled by requestor mode, if so node uncordon will be performed by requestor flow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove nodeState.NodeMaintenance != nil check?

@@ -120,8 +124,18 @@ func (p ConditionChangedPredicate) Update(e event.TypedUpdateEvent[client.Object
return false
}

// check for matching requestor ID
if newO.Spec.RequestorID != p.requestorID {
// check if requestor label exists, if not ignore event
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we skip this label and just check if the requestorID is in spec.requestorID or in spec.additionalRequestors list ?
one less label we need to worry about.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also would be great to have a separate predicate for checking if the nodeMaintenance is relevant for the requestor.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use controller-runtime/pkg/predicate:NewPredicateFuncs() instead of defining a new struct + implementing all callbacks.

@@ -48,6 +48,10 @@ const (
MaintenanceOPEvictionGPU = "nvidia.com/gpu-*"
// MaintenanceOPEvictionRDMA is a default filter for Network OP pods eviction
MaintenanceOPEvictionRDMA = "nvidia.com/rdma*"
// DefaultNodeMaintenanceRequestorID is a default requestor ID for NVIDIA operators, in nodeMaintenance object
DefaultNodeMaintenanceRequestorID = "nvidia.operator.com"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider removing the default requestor ID, i think every consumer of this lib (an operator) should provide its own ID. less risk of defaulting to the same ID.

@@ -165,13 +179,16 @@ func SetDefaultNodeMaintenance(opts RequestorOptions,
func (m *RequestorNodeStateManagerImpl) NewNodeMaintenance(nodeName string) *maintenancev1alpha1.NodeMaintenance {
nm := defaultNodeMaintenance.DeepCopy()
nm.Name = m.getNodeMaintenanceName(nodeName)
// mark nodeMaintenance object as updated by the requestor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, i think we can drop this label ?

return nil
}

m.Log.V(consts.LogLevelInfo).Info("appending new requestor", nm.Spec.RequestorID, "under 'AdditionalRequestors' list")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: restructure this and use m.opts.MaintenanceOPRequestorID

m.Log.V(consts.LogLevelInfo).Info("appending new requestor under AdditionalRequestors", "requestor", m.opts.MaintenanceOPRequestorID, "nodeMaintenance", client.ObjectKeyFromObject(nm))

if nm.Labels == nil {
nm.Labels = make(map[string]string)
}
// only set label if it doesn't exist
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can drop label ?

// check if object is owned by deleting requestor, if so proceed to deletion
if nm.Spec.RequestorID == m.opts.MaintenanceOPRequestorID {
m.Log.V(consts.LogLevelInfo).Info("deleting node maintenance",
nodeState.NodeMaintenance.GetName(), nodeState.NodeMaintenance.GetNamespace())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: m.Log.V().Info("....", "nodeMaintenance", client.ObjectKeyFromObject(nodeState.NodeMaintenance))

nodeState.NodeMaintenance.GetName(), nodeState.NodeMaintenance.GetNamespace())
err := m.deleteNodeMaintenance(ctx, nodeState)
if err != nil {
m.Log.V(consts.LogLevelWarning).Error(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "failed to delete NodeMaintenance, node uncordon failed", "nodeMaintenance", client.ObjectKeyFromObject(nodeState.NodeMaintenance)

})
// remove requestorID label from the object
// only remove label if it exists
if nm.Labels != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: drop label ?

patch := client.MergeFromWithOptions(originalNm, client.MergeFromWithOptimisticLock{})
err := m.K8sClient.Patch(ctx, nm, patch)
if err != nil {
return fmt.Errorf("failed to update nodeMaintenance. %v", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lets be a bit more verbose: "failed to remove requestor from additionalRequestors. failed to patch nodeMaintenance %s. %w", client.ObjectKeyFromObject(nodeState.NodeMaintenenace), err

m.Log.V(consts.LogLevelDebug).Info("deleting node maintenance",
nodeState.NodeMaintenance.GetName(), nodeState.NodeMaintenance.GetNamespace())
// skip in case node undergoes uncordon by inplace flow
// skip in case node undergoes uncordon by in-place flow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the flow here should be:

func (m *RequestorNodeStateManagerImpl) ProcessUncordonRequiredNodes(
	ctx context.Context, currentClusterState *ClusterUpgradeState) error {
	m.Log.V(consts.LogLevelInfo).Info("ProcessUncordonRequiredNodes")

	for _, nodeState := range currentClusterState.NodeStates[UpgradeStateUncordonRequired] {
		// check if if node upgrade is handled by requestor mode, if not, its handled by in-place flow so nothing to do
		if _, exists := nodeState.Node.Annotations[GetUpgradeRequestorModeAnnotationKey()]; !exists {
			continue
		}

		if nodeState.NodeMaintenance == nil {
			// only when nodeMaintenance obj is deleted, node state should be updated to 'upgrade-done'
			err := m.NodeUpgradeStateProvider.ChangeNodeUpgradeState(ctx, nodeState.Node, UpgradeStateDone)
			if err != nil {
				m.Log.V(consts.LogLevelError).Error(
					err, "Failed to change node upgrade state", "state", UpgradeStateDone)
				return err
			}
			// remove requestor upgrade annotation
			err = m.NodeUpgradeStateProvider.ChangeNodeUpgradeAnnotation(ctx,
				nodeState.Node, GetUpgradeRequestorModeAnnotationKey(), "null")
			if err != nil {
				return fmt.Errorf("failed to remove '%s' annotation from node %s. %w", GetUpgradeRequestorModeAnnotationKey(), nodeState.Node.Name, err)
			}
			return nil
		}
		err := m.deleteOrUpdateNodeMaintenance(ctx, nodeState)
		if err != nil {
			m.Log.V(consts.LogLevelWarning).Error(
				err, "failed to delete or update NodeMaintenance. Node uncordon failed",
				"nodeMaintenance", client.ObjectKeyFromObject(nodeState.NodeMaintenance), "node", nodeState.Node.Name)
			return err
		}
	}
	return nil
}

in case of the node's requestor mode annotation does not exist, its fully handled by in place (i.e uncordon node and move to upgrade state done)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants